Targeted capture and massively parallel sequencing of 12 human exomes

نویسندگان

  • Sarah B. Ng
  • Emily H. Turner
  • Peggy D. Robertson
  • Steven D. Flygare
  • Abigail W. Bigham
  • Choli Lee
  • Tristan Shaffer
  • Michelle Wong
  • Arindam Bhattacharjee
  • Evan E. Eichler
  • Michael Bamshad
  • Deborah A. Nickerson
  • Jay Shendure
چکیده

Genome-wide association studies suggest that common genetic variants explain only a modest fraction of heritable risk for common diseases, raising the question of whether rare variants account for a significant fraction of unexplained heritability. Although DNA sequencing costs have fallen markedly, they remain far from what is necessary for rare and novel variants to be routinely identified at a genome-wide scale in large cohorts. We have therefore sought to develop second-generation methods for targeted sequencing of all protein-coding regions (‘exomes’), to reduce costs while enriching for discovery of highly penetrant variants. Here we report on the targeted capture and massively parallel sequencing of the exomes of 12 humans. These include eight HapMap individuals representing three populations, and four unrelated individuals with a rare dominantly inherited disorder, Freeman–Sheldon syndrome (FSS). We demonstrate the sensitive and specific identification of rare and common variants in over 300 megabases of coding sequence. Using FSS as a proof-of-concept, we show that candidate genes for Mendelian disorders can be identified by exome sequencing of a small number of unrelated, affected individuals. This strategy may be extendable to diseases with more complex genetics through larger sample sizes and appropriate weighting of nonsynonymous variants by predicted functional impact. Protein-coding regions constitute ,1% of the human genome or ,30 megabases (Mb), split across ,180,000 exons. A brute-force approach to exome sequencing with conventional technology is expensive relative to what may be possible with second-generation platforms. However, the efficient isolation of this fragmentary genomic subset is technically challenging. The enrichment of an exome by hybridization of shotgun libraries constructed from 140 mg of genomic DNA to seven microarrays was described previously. To improve the practicality of hybridization capture, we developed a protocol to enrich for coding sequences at a genome-wide scale starting with 10 mg of DNA and using two microarrays. Our initial target was 27.9 Mb of coding sequence defined by CCDS (the NCBI Consensus Coding Sequence database). This curated set avoids the inclusion of spurious hypothetical genes that contaminate broader exome definitions. The target is reduced to 26.6 Mb on exclusion of regions that are poorly mapped with our anticipated read length owing to paralogous sequences elsewhere in the genome (Supplementary Data 1). We captured and sequenced the exomes of eight individuals previously characterized by the HapMap and Human Genome Structural Variation projects. We also analysed four unrelated individuals affected with Freeman–Sheldon syndrome (FSS; Online Mendelian Inheritance in Man (OMIM) #193700), also called distal arthrogryposis type 2A, a rare autosomal dominant disorder caused by mutations in MYH3 (ref. 5). Unpaired, 76 base-pair (bp) reads from post-enrichment shotgun libraries were aligned to the reference genome. On average, 6.4 gigabases (Gb) of mappable sequence was generated per individual (20-fold less than whole genome sequencing with the same platform), and 49% of reads mapped to targets (Supplementary Table 1). After removing duplicate reads that represent potential polymerase chain reaction artefacts, the average fold-coverage of each exome was 513 (Supplementary Fig. 1). On average per exome, 99.7% of targeted bases were covered at least once, and 96.3% (25.6 Mb) were covered sufficiently for variant calling ($83 coverage and Phred-like consensus quality $30). This corresponded to 78% of genes having .95% of their coding bases called (Supplementary Fig. 2 and Supplementary Data 2). The average pairwise correlation coefficient between individuals for gene-bygene coverage was 0.87, consistent with systematic bias in coverage between individual exomes. False positives and false negatives are critical issues in genomic resequencing. We assessed the quality of our exome data in four ways. First, comparing sequence-based calls for the eight HapMap exomes to array-based genotyping, we observed a high concordance with both homozygous (99.94%; n5 219,077) and heterozygous (99.57%; n5 43,070) genotypes (Table 1). Second, we compared our coding single-nucleotide polymorphism (cSNP) catalogue to ,1 Mb of coding sequence determined in each of the eight HapMap individuals by molecular inversion probe (MIP) capture and direct resequencing. At coordinates called in both data sets, 99.9% of all cSNPs (n5 4,620) and 100% of novel cSNPs (n5 334) identified here were concordant, consistent with a low false discovery rate. Third, we compared the NA18507 cSNPs identified here to those called by recent whole genome sequencing of this individual, and found substantial overlap (Supplementary Fig. 3). The relative numbers of cSNPs called by only one approach, and the proportions of these represented in dbSNP, indicate that exome sequencing has equivalent sensitivity for cSNP detection compared to whole genome sequencing. Fourth, we compared our data to cSNPs in high-quality Sanger sequence of single haplotype regions from fosmid clones of the same HapMap individuals. Most fosmiddefined cSNPs (38 of 40) were at coordinates with sufficient coverage in our data for variant calling. Of these, 38 of 38 were correctly identified as variant. A comparison of our data to past reports on exonic or exomic array-based capture revealed roughly equivalent capture specificity, but greater completeness in terms of coverage and variant calling (Supplementary Table 2). These improvements probably arise from a combination of greater sequencing depth and differences in array designs and in experimental conditions for capture. Within the set of

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Targeted-capture massively-parallel sequencing enables robust detection of clinically informative mutations from formalin-fixed tumours

Massively parallel sequencing offers the ability to interrogate a tumour biopsy for multiple mutational changes. For clinical samples, methodologies must enable maximal extraction of available sequence information from formalin-fixed and paraffin-embedded (FFPE) material. We assessed the use of targeted capture for mutation detection in FFPE DNA. The capture probes targeted the coding region of...

متن کامل

Exome sequencing: the sweet spot before whole genomes

The development of massively parallel sequencing technologies, coupled with new massively parallel DNA enrichment technologies (genomic capture), has allowed the sequencing of targeted regions of the human genome in rapidly increasing numbers of samples. Genomic capture can target specific areas in the genome, including genes of interest and linkage regions, but this limits the study to what is...

متن کامل

The minimal amount of starting DNA for Agilent’s hybrid capture-based targeted massively parallel sequencing

Targeted capture massively parallel sequencing is increasingly being used in clinical settings, and as costs continue to decline, use of this technology may become routine in health care. However, a limited amount of tissue has often been a challenge in meeting quality requirements. To offer a practical guideline for the minimum amount of input DNA for targeted sequencing, we optimized and eval...

متن کامل

Performance of Microarray and Liquid Based Capture Methods for Target Enrichment for Massively Parallel Sequencing and SNP Discovery

Targeted sequencing is a cost-efficient way to obtain answers to biological questions in many projects, but the choice of the enrichment method to use can be difficult. In this study we compared two hybridization methods for target enrichment for massively parallel sequencing and single nucleotide polymorphism (SNP) discovery, namely Nimblegen sequence capture arrays and the SureSelect liquid-b...

متن کامل

The use of targeted genomic capture and massively parallel sequencing in diagnosis of Chinese Leukoencephalopathies

Leukoencephalopathies are diseases with high clinical heterogeneity. In clinical work, it's difficult for doctors to make a definite etiological diagnosis. Here, we designed a custom probe library which contains the known pathogenic genes reported to be associated with Leukoencephalopathies, and performed targeted gene capture and massively parallel sequencing (MPS) among 49 Chinese patients wh...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009